Method for processing natural language questions and apparatus thereof

ABSTRACT

A method and an apparatus for selecting an answer to a natural language question. The method includes: detecting a named entity in the natural language question; extracting information related to an answer from the natural language question; searching in linked data according to the detected named entity; generating a candidate answer according to a search result; parsing the candidate answer according to the information related to the answer; and obtaining a value of a feature of the candidate answer; and evaluating each candidate answer by synthesizing the value of the feature of the candidate answer.

CROSS REFERENCE TO RELATED APPLICATION

This application claims priority under 35 U.S.C. 119 from Chinese PatentApplication 200910135368.8, filed Apr. 24, 2009, the entire contents ofwhich are incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to a method and an apparatus forprocessing natural language questions. More particularly, the presentinvention relates to a method and an apparatus capable of answeringnatural language questions using open linked structured information.

2. Description of Related Art

Question Answering (QA) has been a classical and difficult problem inthe area of Artificial Intelligence over the past decades. Given anatural language question, e.g., “Justin Henry's first film role asDustin Hoffman and Meryl Streep's son in this film earned him an Oscarnomination”, a computer system would try to return a correct answer innatural language, e.g., “Kramer vs. Kramer”, just like what a humanbeing would do.

To meet the need for computer systems to process natural languagequestions, Natural Language Processing (NLP) techniques have been widelyproposed to solve most of QA problems by using unstructured data.Undoubtedly, it is reasonable to develop NLP techniques because over 80%data of the world is unstructured.

FIG. 1 illustrates a general architecture of existing QA systems. Asshown in FIG. 1, a general QA system includes a question processingmodule 101, a document/passage retrieval module 103, and an answerprocessing module 105. For a natural language question raised by a user,question parsing and focus detecting are performed in the questionprocessing module 101, which selects keywords for the question. Then thedocument/passage retrieval module 103 performs keywords search in adatabase, and performs document filtering and passage post-filtering ina document containing the keywords, so as to generate candidate answers.Afterwards, the answer processing module 105 performs candidateidentification and answer ranking on the candidate answers generated bythe document/passage retrieval module 103, and finally formulates ananswer to the raised natural language question, so as to output a briefanswer to the user in natural language.

Moreover, QA evaluation systems are developed for QA systems to evaluateperformance of QA systems. As an evaluation platform for QA, TREC OAtrack is the best known evaluation platform for QA in the world, wherevarious dataset and question set are provided to evaluate accuracy andperformance of different QA systems. However, with the advance ofdatabase and semantic Web, structured data are increasingly growing andbecoming more important due to their non-ambiguous characteristicscompared with the NLP over unstructured data. Furthermore, most of largecommercial firms process structured data in their business and storethem into database without transferring them into unstructured data.

To support QA with the structured data inside the corporations, newtechniques have to be developed, e.g., NLDB (natural language database),which combines NLP with database technologies by providing a naturallanguage interface over the database to ease users to issue questions.The NLDB techniques in general depend on syntax of the database schema,where natural language questions are translated into a few executableSQLs in the database. Therefore, it restricts users to ask questionswith specific natural language grammar and returns answers within thescope of the database.

Besides the database, there have been a lot of new structured data withthe progress of realizing semantic Web vision, e.g., RDF (ResourceDescription Framework) data, a form of linked data. Over RDF data,semantic query languages, e.g., SPARQL, have been proposed to query databased on semantics without depending on syntax. However, so far there isno well developed technique to process natural language questions overopen linked data without the limitation of natural language grammar.

SUMMARY OF THE INVENTION

In view of the foregoing situations, the present invention provides amethod, an apparatus and a computer program for processing naturallanguage questions, to answer natural language questions of open domainand free grammar using open linked structured information.

In accordance with one aspect of the present invention, a computerimplemented method for selecting an answer to a natural languagequestion includes the steps of: detecting a named entity in the naturallanguage question; extracting information related to an answer from thenatural language question; searching in linked data according to thedetected named entity; generating at least one candidate answeraccording to a search result; parsing the candidate answer according tothe information related to the answer, and obtaining a value of afeature of the candidate answer; and evaluating each candidate answer bysynthesizing the value of the feature of the candidate answer.

In accordance with another aspect of the present invention, an apparatusfor selecting an answer to a natural language question includes: aquestion parsing module, configured to detect a named entity in thenatural language question and extract information related to an answerfrom the natural language question; a candidate answer generatingmodule, configured to search in linked data according to the detectednamed entity, and generate a candidate answer according to a searchresult; a feature value generating module, configured to parse thecandidate answer according to the information related to the answer, andobtain a value of a feature of the candidate answer; and a candidateanswer evaluating module, configured to evaluate each candidate answerby synthesizing the value of the feature of the candidate answer.

In a further aspect, the present invention provides a computer programproduct for implementing the above method for selecting an answer to anatural language question.

In a still further aspect, the present invention provides a computerprogram product which, when executed by a computer, will cause thecomputer to function as the system for selecting an answer to a naturallanguage question.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be better understood from the followingdescription when taken in conjunction with the accompanying drawings. Inthe accompanying drawings, the same or corresponding technical featuresor components are represented by the same or corresponding referencesigns. The accompanying drawings together with the following detaileddescription are included in this specification and form a part of thespecification, which are used to describe the principle and advantagesof the present invention and preferred embodiments of the presentinvention by way of example. In the figures:

FIG. 1 illustrates a general architecture of an existing QA system;

FIG. 2 illustrates a graph structure of RDF triples;

FIG. 3 is a general flow chart of a method for processing naturallanguage questions according to an embodiment of the present invention;

FIG. 4 is a flow chart of the step of searching in a linked database andgenerating a candidate answer according to an embodiment of the presentinvention;

FIG. 5 is a schematic structural block diagram of an apparatus forprocessing natural language questions according to an embodiment of thepresent invention;

FIG. 6 is a schematic structural block diagram of a candidate answergenerating module according to an embodiment of the present invention;and

FIG. 7 is a structural block diagram of an information processing devicefor implementing a method for processing natural language questionsaccording to the present invention.

It should be understood by those skilled in the art components in thefigures are showed for the purpose of simplifying and clarifying only,and may not be illustrated in proportion. For example, some of thecomponents in the drawings may be enlarged compared with othercomponents, so as to improve the understanding of embodiments of thepresent invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

First, a brief description of the present invention is given to providea basic understanding of the present invention in some aspects. Thisintroductory description is not exhaustive. It is not intended todetermine a key or important part of the present invention, nor is itintended to define the scope of the present invention. The purpose ofthe introductory description is merely to give some ideas in a simplemanner, which serves as a preface to the subsequent detaileddescription.

According to an aspect of the present invention, a method for processinga natural language question is provided, including: detecting a namedentity in the natural language question; extracting information relatedto an answer from the natural language question; searching in linkeddata according to the detected named entity; generating a candidateanswer according to a search result; parsing the candidate answeraccording to the information related to the answer, and obtaining avalue of a feature of the candidate answer; and evaluating eachcandidate answer by synthesizing the value of the feature of thecandidate answer.

According to a preferred embodiment of the present invention, the stepof searching in linked data according to the detected named entitycomprises: searching for a Uniform Resource Identifier, URI, matching tothe named entity in the linked data based on similarity; and searchingwith spreading activation, for a URI linked to the URI matching to thenamed entity by using linking relationship between URIs. Furthermore,the method includes generating the candidate answer according to thelinked URI.

Preferably, candidate answers retrieved from different linked data aremerged according to the feature of the candidate answers, before thestep of evaluating each candidate answer by synthesizing the value ofthe feature of the candidate answer

A method according to a preferred embodiment of the present inventionfurther includes performing machine learning according to the feature ofthe candidate answer to train a scoring model; and computing a score foreach candidate answer according to the scoring model while synthesizingthe value of the feature of the candidate answer to evaluate eachcandidate answer.

According to another aspect of the present invention, an apparatus forprocessing a natural language question is provided, including: aquestion parsing module, configured to detect a named entity in thenatural language question and extract information related to an answerfrom the natural language question; a candidate answer generatingmodule, configured to search in linked data according to the detectednamed entity, and generate a candidate answer according to a searchresult; a feature value generating module, configured to parse thecandidate answer according to the information related to the answer, andobtain a value of a feature of the candidate answer; and a candidateanswer evaluating module, configured to evaluate each candidate answerby synthesizing the value of the feature of the candidate answer.

Exemplary embodiments of the present invention are described inconjunction with the accompanying drawings hereinafter. For the sake ofclarity and conciseness, not all characteristics of practicalembodiments are described in the specification. However, it should beappreciated that many embodiment-specific decisions have to be made indeveloping one of the practical embodiments, in order to achieve aparticular object of the developer, e.g., the accordance with a systemand service associated restricting condition which may be changeabledependent on different embodiments. Furthermore, it should also beunderstood that, although the development may be complex andtime-consuming, it is just a routine task for those skilled in the artbenefited from the present disclosure.

It should be further noted here that only apparatus structures and/orprocessing steps directly related to the solution according to thepresent invention are illustrated in the figures, and other details lessrelated to the present invention are omitted, so that the presentinvention would not be blurred by unnecessary details.

To describe the principle of the present invention, RDF data are used asan example of linked data to describe embodiments of the presentinvention hereinafter, because RDF data are prevailing on the Web andcover various data and knowledge. Particularly, so far the W3C LinkingOpen Data (LOD) project have interlinked more than 30 open licensedatasets which consists of over 2 billion RDF triples.

Besides the physical RDF data, virtual RDF datasets are growing as well.Many large corporations manage and process structured data inside theirindividual business system and need to integrate their structured dataas well. A virtual RDF view can be conveniently built based on theirstructured databases using some semantic Web tools such as Virtuso, D2Rand SeDA.

However, it should be understood by those skilled in the art that, thepresent invention is not limited to RDF data, but can also be applied tovarious linked data, such as linked data obtained by mappingMicro-format data.

Next, Dbpedia is used as a particular example of RDF, and the principleof the present invention is illustrated hereinafter by describing howthe answer to the natural language question “In this 1992 Robert Altmanfilm, Tim Robbins gets angry messages from a screenwriter he's snubbed”is obtained.

Some RDF triple data related to the above natural language question inDBpedia are listed below first, and its graph structure is illustratedin FIG. 2.

 <http://dbpedia.org/resource/The_Player><http://dbpedia.org/property/director><http://dbpedia.org/resource/Robert_Altman>. <http://dbpedia.org/resource/The_Player><http://www.w3.org/2000/01/rdf-schema%23label> “The Player”@en. <http://dbpedia.org/resource/Gosford_Park><http://dbpedia.org/property/director><http://dbpedia.org/resource/Robert_Altman>. <http://dbpedia.org/resource/Robert_Altman><http://dbpedia.org/property/birthPlace><http://dbpedia.org/resource/Kansas_City%2C_Missouri>. <http://dbpedia.org/resource/The_Player><http://dbpedia.org/property/starring><http://dbpedia.org/resource/Tim_Robbins>. <http://dbpedia.org/resource/Tim_Robbins><http://dbpedia.org/property/spouse><http://dbpedia.org/resource/Susan_Sarandon>. <http://dbpedia.org/resource/The_Player><http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://dbpedia.org/class/yago/MotionPictureFilm103789400>. <http://dbpedia.org/class/yago/MotionPictureFilm103789400><http://www.w3.org/2000/01/rdf-schema%23subClassOf><http://dbpedia.org/class/yago/Film103435300>. <http://dbpedia.org/class/yago/Film103435300><http://www.w3.org/2000/01/rdf-schema%23label> “Film”@en.

In FIG. 2, circles represent URIs (Universal Resource Identifiers)related to named entities, which are subjects and objects in the RDFtriples. Lines connecting two circles indicate the relationship betweenthe named entities, which are predicates in the RDF triples. Taking thefirst of the above RDF triples as an example:“<http://dbpedia.org/resource/The_Player><http://dbpedia.org/property/director><http://dbpedia.org/resource/Robert_Altman>”, “The_Player” and“Robert_Altman” are named entities,“<http://dbpedia.org/resource/The_Player>” is the URI related to thenamed entity “The_Player”, and<http://dbpedia.org/resource/Robert_Altman>” is the URI related to thenamed entity “Robert_Altman”, which are therefore indicated by circlesin the graph structure shown in FIG. 2. Furthermore, as the predicate inthe RDF triple, “<http://dbpedia.org/property/director>” indicates therelationship between the named entities “The_Player” and“Robert_Altman”, that is, “Robert_Altman” is the “director” of the film“The_Player”. Other RDF triples can be parsed in the same way and arenot listed here one by one.

FIG. 3 is a general flow chart of a method for processing naturallanguage questions according to an embodiment of the present invention.As shown in FIG. 3, the method for processing natural language questionsaccording to an embodiment of the present invention includes namedentity detection step S301, answer-related information extraction stepS303, linked database retrieval step S305, candidate answer generationstep S307, feature value obtaining step S309, and candidate answerevaluation step S311.

First, in step S301 for named entity detection, a natural languagequestion inputted by the user is parsed and a named entity is detected.Next, information related to an answer is extracted from the naturallanguage question in the answer-related information extraction stepS303.

For example, from the natural language question “In this 1992 RobertAltman film, Tim Robbins gets angry messages from a screenwriter he'ssnubbed”, named entities “Robert_Altman” and “Tim Robbins” can bedetected, and information “film” related to the type of the answer andtime verification information “1992” related to the answer can beextracted.

Then, in step S305 for linked database retrieval, search is performed indifferent data sources such as linked data of DBpedia and IMDB based onthe named entities detected in named entity detection step S301. Next,in candidate answer generation step S307, a candidate answer isgenerated based on a search result from linked database retrieval stepS305.

FIG. 4 is a flow chart of the step of searching in a linked database andgenerating a candidate answer according to an embodiment of the presentinvention. As shown in FIG. 4, first in matching step S401, a URImatching to a named entity is searched for in linked data based onsimilarity. For the above exemplary natural language question, based onthe named entities “Robert_Altman” and “Tim Robbins” detected in namedentity detection step S301, matching URIs“<http://dbpedia.org/resource/Robert_Altman>” and“<http://dbpedia.org/resource/Tim_Robbins>” can be retrieved fromDBpedia respectively.

Next, in spreading activation step S403, a URI directly linked to theURI matching to the named entity is searched for with spreadingactivation using linking relationship between URIs. In the exampleabove, for the URI “<http://dbpedia.org/resource/Robert_Altman>”matching to the named entity “Robert_Altman”, URIs directly linked to itcan be obtained easily from the graph structure shown in FIG. 2 byspreading activation, e.g. “<http://dbpedia.org/resource/The_Player>”,“<http://dbpedia.org/resource/Gosford_Park>” and“<http://dbpedia.org/resource/Kansas_City %2C_Missouri>”. For the URI“<http://dbpedia.org/resource/Tim_Robbins>” matching to the named entity“Tim Robbins”, URIs directly linked to it can also be easily obtainedfrom the graph structure shown in FIG. 2 by spreading activation, e.g.“<http://dbpedia.org/resource/The_Player>” and“<http://dbpedia.org/resource/Susan_Sarandon>”.

After obtaining each of the above URIs by spreading activation,candidate answers can be extracted from the directly linked URIs, wherethe candidate answers may be a label contained in a URI. For the aboveexample, candidate answers such as “The_Player”, “Gosford_Park”,“Kansas_City” and “Susan_Sarandon” can be extracted from the directlylinked URIs obtained in spreading activation step S403. In thisembodiment, URIs directly linked to the URIs matching to the namedentities are searched for with spreading activation, and candidateanswers are generated based on directly linked URIs. However, thoseskilled in the art would understand that, it is not limited to thedirectly linked URIs in searching with spreading activation andgenerating candidate answers.

After generating candidate answers according to the process illustratedin FIG. 4, next in the feature value obtaining step S309 shown in FIG.3, the candidate answers are parsed based on the information related tothe answer extracted in answer-related information extraction step S303,so as to obtain values of a feature of the candidate answers.

The feature of the candidate answers here includes the informationrelated to the answer, and the number of directly linked URIs associatedwith the candidate answer. The information related to the answerincludes, for example, information “film” related to the type of theanswer and time verification information “1992” related to the answer.Answer-type related information may be indicated by “tycor”, and timeverification information may be directly indicated by “year”. The numberof directly linked URIs associated with the candidate answer is, forexample, the number of URIs directly linked to each of the URIs of thecandidate answers, which feature is hereby indicated by “triple”.Accordingly, values of features of each candidate answers for the abovespecific example are given in Table 1.

TABLE 1 Feature values of candidate answers Candidate answer Featurevalue The_Player tycor = 1 triple = 2 year = 1 Gosford_Park tycor = 1triple = 1 year = 0 Kansas_City tycor = 0 triple = 1 year = 0Susan_Sarandon tycor = 0 triple = 1 year = 0

As can be seen from Table 1, for the feature “tycor”, as the candidateanswers “The_Player” and “Gosford_Park” both are film titles, consistentwith the answer-type related information “film” extracted inanswer-related information extraction step S303, therefore theirtycor=1. The candidate answer “Kansas_City” is a city name, and“Susan_Sarandon” is a human name, not consistent with the answer-typerelated information “film”, therefore their tycor=0. For the feature“triple”, it can be seen intuitively from FIG. 2 that, the numbers ofURIs directly linked to the candidate answers “The_Player”,“Gosford_Park”, “Kansas_City” and “Susan_Sarandon” and related to thenamed entities “Robert_Altman” and “Tim Robbins” are 2, 1, 1 and 1,respectively, therefore their “triple” values are assigned with 2, 1, 1and 1, respectively. For the feature “year”, as the time verificationinformation “1992” extracted in answer-related information extractionstep S303 is present only in the URI“<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>” linked to thecandidate answer “The_Player”, the “year” value of the candidate answer“The_Player” is assigned with 1, and the “year” value of the othercandidate answers are assigned with zero.

It should be noted that, features of the candidate answers are notlimited to information related to a type of the answer, the number ofdirectly linked URIs related to the candidate answers, and timeverification information related to the answer as mentioned in the aboveexample, but may include various information relating to an answer,named entity, URI or the like, for example, linking information betweenURIs matching to a named entity.

After obtaining the values of features of candidate answers in featurevalue obtaining step S309, the values of features of candidate answerscan be synthesized in the candidate answer evaluation step S311, so thateach of the candidate answers can be evaluated and a best answer can beprovided to the user.

According to a preferred embodiment of the present invention, machinelearning is performed in advance in accordance with given features ofcandidate answers, to obtain a satisfying scoring model. Accordingly,when synthesizing the values of features of candidate answers in thecandidate answer evaluation step S311, a score can be computed for eachcandidate answer using the trained scoring model, and the candidateanswer with the highest score can be selected as the final answerprovided to the user. Table 2 below shows the scoring results obtainedby evaluating each candidate answer in the above example.

TABLE 2 Evaluation of candidate answers Evaluation Candidate answerFeature value score The_Player tycor = 1 triple = 2 year = 1 100Gosford_Park tycor = 1 triple = 1 year = 0 60 Kansas_City tycor = 0triple = 1 year = 0 0 Susan_Sarandon tycor = 0 triple = 1 year = 0 0

In Table 2, for the candidate answer “The_Player”, not only its answertype matches the desired answer type, but also its time relatedverification information conforms, and it has the largest number ofdirectly linked URIs associated with candidate answers, therefore, it isgiven the highest score 100 and provided to the user as the best answer.For the candidate answer “Gosford_Park”, as its feature “year=0”, andthe number of directly linked URIs associated with candidate answers isonly 1, therefore it is not the best answer and given a score 60although its answer type matches the desired answer type. Furthermore,for the candidate answers “Kansas_City” and “Susan_Sarandon”, as both oftheir answer-type values are 0 and do not match the desired answer type,both of their final evaluation scores are 0.

As a matter of course, the evaluation results in Table 2 are given asexamples only. In practice, different weights may be given to thefeatures based on different situations, and evaluation of candidateanswers can be performed accordingly.

Furthermore, it should also be noted that, candidate answers are notnecessarily obtained from the same linked data, e.g. DBpedia used in theabove example. Candidate answers may be retrieved from different linkeddata. Therefore, if candidate answers are obtained from different linkeddata respectively, before evaluating the candidate answers in candidateanswer evaluation step S311, the candidate answers retrieved fromdifferent linked data may be merged according to a feature of thecandidate answers, so that repeated candidate answers can be avoided.

The processing process of the method for processing natural languagequestions according to an embodiment of the present invention isdescribed above. The working principle of an apparatus for processingnatural language questions according to an embodiment of the presentinvention is described hereinafter in conjunction with FIG. 5 and FIG.6.

FIG. 5 is a structural block diagram of an apparatus 500 for processingnatural language questions according to an embodiment of the presentinvention. As shown in FIG. 5, the apparatus 500 for processing naturallanguage questions according to an embodiment of the present inventionincludes: a question parsing module 501, a candidate answer generatingmodule 503, a feature value generating module 505, and a candidateanswer evaluating module 507.

First, the question parsing module 501 parses the natural languagequestion, detects a named entity and extracts information related to ananswer from the natural language question. Then, the candidate answergenerating module 503 searches in linked data such as DBpedia and IMDBaccording to the named entity detected by the question parsing module501, and thereby generates candidate answers. Next, the feature valuegenerating module 505 parses the candidate answers generated by thecandidate answer generating module 503 according to the informationrelated to the answers, and obtains values of a feature of the candidateanswers. Finally, the candidate answer evaluating module 507 evaluateseach candidate answer by synthesizing the values of the features of thecandidate answers, and provides the best candidate answer to the user asthe final result.

FIG. 6 is a schematic structural block diagram of a candidate answergenerating module 600 according to a preferred embodiment of the presentinvention. As shown in FIG. 6, the candidate answer generating module600 according to the embodiment includes a matching unit 601, aspreading activation unit 603 and a candidate generating unit 605.

The matching unit 601 searches for a URI matching to the named entity inthe linked data based on similarity; the spreading activation unit 603searches with spreading activation for a URI directly linked to the URIobtained by the matching unit 601 matching to the named entity by usingthe linking relationship between URIs; and the candidate generating unit605 generates the candidate answers according to the directly linked URIretrieved by the spreading activation unit 603.

The candidate generating unit 605 may use a label contained in a URI asa candidate answer. The feature of the candidate answers should at leastinclude information related to the answer, and the number of directlylinked URIs associated with the candidate answers. The informationrelated to the answer at least includes the type of the answer.

According to a preferred embodiment of the present invention, theinformation related to the answer may further include time verificationinformation related to the answer extracted from the natural languagequestion, and the features of the candidate answers may further includelinking information between URIs matching to a named entity.

It should be noted that, candidate answers are not necessarily obtainedfrom the same linked data, but may be retrieved from different linkeddata. Therefore, a preferred embodiment of the present invention mayinclude a merging module (not shown in the figure), which is configuredto, if candidate answers are obtained from different linked data, mergethe candidate answers retrieved from different linked data according toa feature of the candidate answers before the candidate answerevaluation module 507 evaluates the candidate answers, so that repeatedcandidate answers can be avoided.

In addition, the apparatus for processing natural language questionsaccording to a preferred embodiment of the present invention may furtherinclude a training module (not shown in the figure), which is configuredto perform machine learning in advance according to given features ofcandidate answers, so as to obtain a satisfying scoring model.Accordingly, when the candidate evaluation module 507 synthesizes thevalues of features of candidate answers, a score can be computed foreach candidate answer using the trained scoring model, and the candidateanswer with the highest score can be selected as the final answerprovided to the user.

It should also be noted that, detailed processing processes of thequestion parsing module 501, the candidate answer generating module 503,the feature value generating module 505, and the candidate answerevaluating module 507 in the apparatus for processing natural languagequestions according to the present invention are similar to those ofnamed entity detection step S301, answer-related information extractionstep S303, linked database retrieval step S305, candidate answergeneration step S307, feature value obtaining step S309 and candidateanswer evaluation step S311 in the method for processing naturallanguage questions described with reference to FIG. 3, respectively. Anddetailed processes of the matching unit 601, the spreading activationunit 603 and the candidate generating unit 605 in the candidate answergenerating module 600 are similar to those of the matching step S401,the spreading activation step S403 and the candidate generating stepS405 in the candidate answer generation method described with referenceto FIG. 4. Therefore, further detailed description is omitted.

As can be seen from the description of the embodiments of the presentinvention and the analysis of the prior art, when analyzingdocuments/sentences/words using NLP techniques, as natural language isextremely hard to be well parsed, existing QA systems over unstructureddata have to process many ambiguous problems. However, the method andapparatus for processing natural language questions according to anembodiment of the present invention is a QA system over structured data,therefore may improve precision of QA systems based on existing hugeamount of linked data.

In addition, the method and the apparatus for processing naturallanguage questions according to an embodiment of the present inventionmay assist corporations enable QA systems over a virtual RDF view,applicable for huge amount of RDF data and virtual RDF data generated bythe corporations without need of changing the existing QA systems.

The basic principle of the present invention is described in conjunctionwith the embodiments above. However, for those skilled in the art, itshould be understood that, each or any step or component of the methodand the apparatus of the present invention may be implemented withhardware, firmware, software or a combination thereof in any computingapparatus, including processors, storage medium and the like, or anetwork of computing apparatuses, which can be done by those skilled inthe art with basic programming skills after reading the specification ofthe present invention.

Therefore, the object of the present invention may also be implementedby executing a program or a series of programs on any computingapparatus. The computing apparatus can be a known general-purposeapparatus. Therefore, the object of the present invention can beimplemented through program products providing program codes thatimplement the method or the apparatus. That is, such a program productalso constitutes the present invention, and storage medium stored withsuch a program product also constitute the present invention.Apparently, the storage medium can be any known storage medium or anystorage medium to be developed in the future.

In case of implementing the embodiments of the present invention bysoftware and/or firmware, a program constituting the software may beinstalled into a computer with dedicated hardware, for example, ageneral-purpose personal computer 700 as shown in FIG. 7 from a storagemedium or a network, and the computer is capable of performing variousfunctions if with various programs installed therein.

In FIG. 7, a Central Processing Unit (CPU) 701 performs variousprocessing based on a program stored in a Read Only Memory (ROM) 702 ora program loaded from a storage section 708 to a Random Access Memory(RAM) 703. In the RAM 703, required data when the CPU 701 performs thevarious processing or the like is also stored as necessary. The CPU 701,the ROM 702, and the RAM 703 are connected to one another via a bus 704.An input/output interface 705 is also connected to the bus 704.

The following components are connected to the input/output interface705: an input section 706 including a keyboard, a mouse, or the like; anoutput section 707 including a display such as a Cathode Ray Tube (CRT),a Liquid Crystal Display (LCD), or the like, and a loudspeaker or thelike; the storage section 708 including a hard disk or the like; and acommunication section 709 including a network interface card such as aLAN card, a modem, or the like. The communication section 709 performscommunication processing via the network such as the Internet.

A drive 710 is also connected to the input/output interface 705 asnecessary. A removable medium 711, such as a magnetic disk, an opticaldisk, a magneto-optical disk, a semiconductor memory, or the like, isinstalled on the drive 710 as necessary, so that a computer program readtherefrom may be installed into the storage section 708 as necessary.

In the case where the above-described series of processing isimplemented with software, the program that constitutes the software maybe installed from a network such as the Internet or a storage mediumsuch as the removable medium 711.

Those skilled in the art would appreciate that, the storage medium isnot limited to the removable medium 711 having the program storedtherein as illustrated in FIG. 7, which is distributed separately fromthe device for providing the program to the user. Examples of theremovable medium 711 include a magnetic disk (including a floppy disk),an optical disk (including a Compact Disk-Read Only Memory (CD-ROM) anda Digital Versatile Disk (DVD)), a magneto-optical disk (including aMini-Disk (MD) (registered trademark)), and a semiconductor memory.Alternatively, the storage medium may be the ROM 702, the hard diskcontained in the storage section 708, or the like, which has the programstored therein and is distributed to the user together with the devicethat contains them.

It should also be noted that, in the apparatus and method of the presentinvention, components or steps may be decomposed and/or recombined. Thedecomposition and/or recombination should be considered as equivalentsolutions of the present invention. The steps performing the abovedescribed series of processing need not necessarily be performedchronologically in the natural order of the description. Some steps maybe performed in parallel or independently of one another.

The present invention and its advantages have been described in detail.However, it will be understood by those skilled in the art that variousmodifications, combinations, sub-combinations and alterations dependingon design and other factors are within the scope of the appended claims.The terms “comprise”, “comprising,” “include” or any other variationsthereof, are intended to cover a non-exclusive inclusion so that aprocess, method, article, or device that comprises a list of elementsdoes not only include these elements but also may include other elementsnot explicitly listed or inherent to such process, method, article, ordevice. An element preceded by “a” or “an” does not, if without moreconstraints, preclude the existence of additional identical elements inthe process, method, article, or device that comprises the element.

1. A computer implemented method for selecting an answer to a naturallanguage question, comprising: detecting a named entity in the naturallanguage question; extracting information related to an answer from thenatural language question; searching in linked data according to thedetected named entity; generating at least one candidate answeraccording to a search result; parsing the candidate answer according tothe information related to the answer; obtaining a value of a feature ofthe candidate answer; and evaluating each candidate answer bysynthesizing the value of the feature of the candidate answer.
 2. Themethod according to claim 1, wherein the step of searching in linkeddata according to the detected named entity comprises: searching for aUniform Resource Identifier (URI), matching to the named entity in thelinked data based on similarity; and searching with spreadingactivation, for a URI linked to the URI matching to the named entity byusing linking relationship between URIs; and generating the candidateanswer according to the linked URI.
 3. The method according to claim 2,wherein the candidate answer is a label contained in the URI.
 4. Themethod according to claim 3, wherein the feature of the candidate answercomprises at least the information related to the answer, and the numberof directly linked URIs associated with the candidate answer.
 5. Themethod according to claim 4, wherein the information related to theanswer comprises at least a type of the answer.
 6. The method accordingto claim 5, wherein the information related to the answer furthercomprises time verification information related to the answer extractedfrom the natural language question, and the feature of the candidateanswer further comprises linking information between URIs matching tothe named entity.
 7. The method according to claim 1, further comprisingmerging candidate answers retrieved from different linked data accordingto the feature of the candidate answers.
 8. The method according toclaim 1, further comprising: performing machine learning according tothe feature of the candidate answer to train a scoring model; andcomputing a score for each candidate answer according to the scoringmodel while evaluating each candidate answer.
 9. The method according toclaim 1, wherein the linked data is Resource Description Framework data.10. The method according to claim 1, wherein the linked data is obtainedby mapping Micro-format data.
 11. An apparatus for selecting an answerto a natural language question, comprising: a question parsing moduleconfigured to detect a named entity in the natural language question andextract information related to an answer from the natural languagequestion; a candidate answer generating module configured to search inlinked data according to the detected named entity and generate acandidate answer according to a search result; a feature valuegenerating module configured to parse the candidate answer according tothe information related to the answer and obtain a value of a feature ofthe candidate answer; and a candidate answer evaluating moduleconfigured to evaluate each candidate answer by synthesizing the valueof the feature of the candidate answer.
 12. The apparatus according toclaim 11, wherein the candidate answer generating module comprises: amatching unit, configured to search for a Uniform Resource Identifier,URI, matching to the named entity in the linked data based onsimilarity; a spreading activation unit, configured to search withspreading activation for a URI linked to the URI matching to the namedentity by using linking relationship between URIs; and a candidategenerating unit, configured to generate the candidate answer accordingto the linked URI.
 13. The apparatus according to claim 12, wherein thecandidate generating unit uses a label contained in the URI as thecandidate answer.
 14. The apparatus according to claim 13, wherein thefeature of the candidate answer comprises at least the informationrelated to the answer, and the number of directly linked URIs associatedwith the candidate answer.
 15. The apparatus according to claim 14,wherein the information related to the answer comprises at least a typeof the answer.
 16. The apparatus according to claim 15, wherein: theinformation related to the answer further comprises time verificationinformation related to the answer extracted from the natural languagequestion; and the feature of the candidate answer further compriseslinking information between URIs matching to the named entity.
 17. Theapparatus according to claim 11, further comprising a merging moduleconfigured to merge candidate answers retrieved from different linkeddata according to the feature of the candidate answers.
 18. Theapparatus according to claim 11, further comprising a training moduleconfigured to perform machine learning according to the feature of thecandidate answer to train a scoring model; and wherein the candidateanswer evaluating module computes a score for each candidate answeraccording to the scoring model while evaluating each candidate answer.19. The apparatus according to claim 11, wherein the linked data isResource Description Framework data.
 20. The apparatus according toclaim 11, wherein the linked data is obtained by mapping Micro-formatdata.
 21. A computer-readable storage medium tangibly embodyingcomputer-executable program instructions which, when executed, cause acomputer to perform a method for selecting an answer to a naturallanguage question according to claim
 1. 22. A computer-readable storagemedium tangibly embodying computer-executable program instructionswhich, when executed, cause a computer to be configured to function asthe apparatus for selecting an answer to a natural language questionaccording to claim 11.